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Abstract. This paper presents a multigrid algorithm for the computation of the rank-R canoni- 
cal decomposition of a tensor for low rank R. Standard alternating least squares (ALS) is used as the 
relaxation method. Transfer operators and coarse-level tensors are constructed in an adaptive setup 
phase based on multiplicative correction and on Bootstrap algebraic multigrid. An accurate solution 
is then computed by an additive solve phase based on the Full Approximation Scheme. Numerical 
tests show that for certain test problems the multilevel method significantly outperforms standalone 
ALS when a high level of accuracy is required. 
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1. Introduction. In this paper we present a multigrid method for accurately 
computing a low-rank canonical decomposition of a tensor. An iVth-order tensor is 
an iV-dimensional array of size I\ x • • • x Jjv [20] . The order of a tensor is the number 
of modes (dimensions), and the size of the nth mode is /„ for n = 1,...,N. The 
canonical tensor decomposition is a higher-order generalization of the matrix singular 
value decomposition (SVD) in that it decomposes a tensor as a sum of rank-one 
components. For example, let T be an arbitrary A^th-order tensor of rank R, meaning 
that it can be expressed as a sum of no fewer than R rank-one components. Then its 
canonical decomposition is given by 



aWc-oaW (1.1) 



where o denotes the vector outer product. The rth rank-one component is formed by 
taking the vector outer product of N column vectors ai™^ € M. 1 "- for n = 1, . . . , N. 
For each mode n = 1,...,N, one can store the vectors a£ , . . . , as the columns 
of an I n x R matrix A™. The matrix A^™* 1 is referred to as the mode-n factor matrix 
and its columns are the mode-n factors. The canonical tensor decomposition may 
then be expressed in terms of the factor matrices by the double bracket notation 
[A^, . . . , AW], which is defined as the summation in ( 1.1 1. 



We refer to the canonical decomposition as CANDECOMP /PARAFAC (CP) after 



the names originally given to it in early papers on the subject [T2], [13] . Whereas (1.1 1 
is an example of an exact CP decomposition, referred to as the rank decomposition 
since R = rank(T), our goal is to find a CP decomposition for an arbitrary TVth- 
order tensor %> and a given number of components R. Depending on R the CP 
decomposition may only approximate 2>, in which case it is natural to look for the 
"best" such approximation in some sense. For example, this problem can be made 
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more concrete by posing it as an optimization problem: 

Z- [A a> A'M 



minimize /(A (1) , . . . , A (JV) ) := - % - [[A (1) , . . . , A (Ar) ] 2 . (1.2) 



In words, we seek the factor matrices that minimize the functional /. Here, || • || 
denotes the Frobenius norm of a tensor, defined as the square root of the sum of the 
squared entries of all tensor elements. In what follows, for matrices || • || refers to the 
Frobenius norm, and for vectors it refers to the vector two-norm. A general approach 



to solving the optimization problem (1.2) is to find a set of nontrivial factor matrices 
that zero out the gradient of /. In other words, at any local minimum of (1.2 1 the 
first-order optimality equations must be satisfied: 

aXR=° ^n = l,...,N. (1.3) 

In PQ, Acar, Dunlavy and Kolda propose solving the first-order optimality equations 
by applying gradient-based optimization methods such as the nonlinear conjugate 
gradient method together with a line search. Along the same vein, a new method for 
computing the CP decomposition was recently developed by combining a nonlinear 
generalized minimal residual method with a line search [30]. In this paper we follow 
suit by solving the first-order optimality equations using a multigrid approach. 



The optimization problem (1.2) is non-convex, consequently, it may admit mul- 
tiple local minima. Moreover, for any local minimizer there is a continuous manifold 
of equivalent minimizers |20j . This manifold arises because of a scaling indetermi- 
nacy inherent to the CP decomposition, i.e., the individual factors composing each 
rank-one term can be rescaled without changing the rank-one term. The CP decom- 
position also exhibits a permutation indeterminacy in that the rank-one component 
tensors can be reordered arbitrarily |20j . In the following sections we discuss how 
the scaling and permutation indeterminacies can be removed by imposing a specific 
normalization and ordering of the factors. However, the CP decomposition may still 
exhibit multiple local minima for some problems, and depending on the initial guess, 
iterative methods may converge to different stationary points. Moreover, for certain 
tensors and certain values of R a best rank-i? approximation does not exist [55] . De- 
spite these difficulties, the exact CP decomposition has been shown to be unique up 
to scaling and permutation indeterminacies under mild conditions relating the ranks 
of the factor matrices with the tensor rank, and CP decompositions are used in many 
application fields [20] . 

The primary application of the CP decomposition is as a tool for data analysis, 
where it has been used in a variety of fields including chemometrics, data mining, 
image compression, neuroscience and telecommunications. A second class of problems 
is related to the decomposition of tensors arising from PDE discretization on high- 
dimensional regular lattices j27j EH [17\ [18] . Many algorithms have been proposed for 
computing the CP decomposition 20, 25][22l[3U[T]. However, the workhorse algorithm 
today is still the original alternating least squares (ALS) method, which was first 
proposed in 1970 in early papers on the CP decomposition [T2][T3]. The alternating 
least squares method is simple to implement, and often performs adequately, however, 
it can be very slow to converge, and its convergence may depend strongly on the initial 
guess. Despite its simplicity and potential drawbacks, it has proved difficult over the 
years to develop alternatives to ALS that significantly improve upon it in a robust way 
for large classes of problems. As a result, ALS-type algorithms are still considered 
the method of choice in practice. 
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The multigrid method described in this paper is intended to solve the first-order 
optimality equations for the CP optimization problem (1.3 1. It consists of two multi- 
level phases: a multiplicative correction scheme as the setup phase, and an additive 
correction scheme as the solve phase. In the setup phase a multiplicative correction 
scheme is used in conjunction with Bootstrap algebraic multigrid (BAMG) interpo- 
lation [6j I15j to not only build the necessary transfer and coarse-level operators, but 
also to compute initial approximations of the factor matrices. This phase uses the 
ALS method as the relaxation scheme on all levels. The setup phase is adaptive in 
the sense that the transfer operators are continually improved using the most recent 
approximation to the solution factor matrices. In order for the exact solution to be 
a fixed point of the multiplicative correction scheme, it needs to lie exactly in the 
range of interpolation at convergence. However, since each interpolation operator at- 
tempts to fit multiple factors (in a least-squares sense) this condition can be met only 
approximately. Therefore, after a few setup cycles we freeze the operators and use 
additive correction cycles in the solve phase, which can still converge when the exact 
solution lies only approximately in the range of interpolation. The combination of a 
multiplicative setup scheme and BAMG has already been considered in [21] , where it 
formed the basis of an efficient eigensolver for multiclass spectral clustering problems. 
A similar approach was also proposed in [T5J 13 • In the solve phase we use the Full 
Approximation Scheme (FAS) [5J [H] [TTJ [35] to efficiently obtain an accurate solution. 
Our multigrid framework is closely related to recent work on an adaptive algebraic 
multigrid solver for extremal singular triplets and eigenpairs of matrices [31j . and to 
a lesser degree to multigrid methods for Markov chains [35J [35] [3] . Note that in this 
paper we give a complete but concise description of the proposed algorithm. Readers 
who are not familiar with adaptive algebraic multigrid (AMG) methods can find more 
background information about the general adaptive AMG approach in, for example, 

EH EH [35]. 

The multigrid method proposed in this paper is expected to work well for tensors 
that have properties which make a multilevel approach beneficial, but not for generic 
tensors that lack these properties. Just as in the case of multigrid for matrix systems 
derived from PDE discretizations, our multilevel approach can lead to significant 
speedup when error components that are damped only weakly by the fine-level process 
can be represented and damped efficiently on coarser levels. We expect this to be the 
case for the decomposition of certain higher-order tensors that arise in the context of 
PDE discretization on high-dimensional regular lattices [27] [35] [T7] HH] , and we will 
illustrate the potential benefits of the proposed multigrid method for these types of 
problems. It should also be noted that the type of multigrid acceleration proposed in 
this paper will only be effective for low-rank decompositions with small R (e.g., up 
to 5 or 6) . This is because a single interpolation operator is associated with an entire 
factor matrix, and each interpolation operator can only be expected to represent a 
small number of factors in a sufficiently accurate way, especially if the desired factors 
have little in common. These restrictions are entirely analogous to the case of the 
adaptive multigrid method for computing SVD triplets of a matrix [31 . 

The remainder of this paper is structured as follows. In fj2] we define the basic 
notation and definitions used throughout this paper. Section [3] presents the first-order 
optimality equations and describes the alternating least squares method. Section [4] 
describes the multilevel setup phase, and ^describes the multilevel solve phase. Im- 
plementation details and numerical results are presented in S]6j followed by concluding 
remarks in S]7] 
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2. Notation and definitions. This section outlines the notation that is used 
throughout this paper, much of which has been adopted from [T] . We also review some 
basic definitions and identities that are important to this paper. For further details 
we refer to the survey paper by Kolda and Bader |20j . and the extensive references 
therein. 

Vectors (tensors of order one) are denoted by boldface lowercase letters, e.g., 
v. Matrices (tensors of order two) are denoted by boldface capital letters, e.g., A. 
Higher-order tensors are denoted by boldface Euler script letters, e.g., 2s. The ith 
entry of a vector v is denoted by Vj, element of a matrix A is denoted by ay, 
and, for example, element (i,j,k,l) of a fourth-order tensor 2s is denoted by Zijke- 
The jth column of a matrix A is denoted by a.j. The nth element of a sequence is 
denoted by a superscript in parentheses, e.g., A^ n K In general, indices range from 1 
to their capital versions, e.g., n = 1, . . . , N. 

Matricization, also referred to as unfolding or flattening, is the process of reorder- 
ing the elements of a tensor into a matrix. In this paper we are only interested in 
mode-n matricization, which arranges the mode-n fibers to be the columns of the re- 
sulting matrix. Note that a fiber is a higher-order analogue of matrix rows/columns, 
which is obtained by fixing every index of a tensor but one. Given a tensor 2s the 
mode-n matricized version is denoted by Z(„\. 

Matricization provides an elegant way to describe the product of a tensor by a 
matrix in mode n. The n-mode matrix product of a tensor 2s £ i 7lX '" x/i ' with a 
matrix A e R jxl ™ is denoted by 2sX„A and is of size I\X- ■ -X I„-iX Jx I n +i x - ■ - xI N . 
This product can be expressed in terms of unfolded tensors as follows 

X = 2s x n A X( n ) = AZ(„). 

For matrices A £ ~R IxK and B G M. JxK , their Khatri-Rao product results in a 
matrix of size (IJ) x K defined by 

A B = [ai ® bi • • • a K h K ], 

where is the Kronecker product. These products have many useful properties, 
however, we will only need the associativity of the Khatri-Rao product, and the 
mixed-product property of the Kronecker product, i.e., (A B)(C D) = AC BD. 
We can then easily prove the following useful identity for any sequences of matrices 
A'"' and B(™', n = 1, . . . , N, of the appropriate sizes 

A«B« • • • AWBW = (A« • • • AW) (B« • • • B W) . (2.1) 

Another useful relationship between tensors and their matricized versions is as follows. 
Let 2s G M J i x - x/ « and A («) e M J„x/„ f or n = i } . . . ) n. Then, for any n € {1, ... , N} 

X = 2sx 1 A«--- x N A^ X w =AWz (n) fAW®-®AW) T . (2.2) 

A proof of this property is given in [19j . 

To denote the product of a tensor and a sequence of matrices over some nonempty 
subset of the modes N = {m, . . . , Uk) C {1, . . . , N}, we use the following notation as 
shorthand 



2s x rieN AW - 2s x m ■ ■ ■ x nk A (n »\ 
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3. CP first-order optimality equations and alternating least squares. 

The first-order optimality equations for the CP decomposition are obtained by setting 



the gradient of the functional in (1.2 1 equal to zero. Following pQ, for each mode 
n e {1, . . . , N} the derivative of / with respect to can be written as an /„ x R 
matrix 



where 



and 



GW = -Z (n) #W + A^'rl"', (3.1) 

$(«) = A (N) . . . A (n+1) Q A (n-1) q ... q A (l) ) (3.2) 
r (n) = T (l) „,...„, y(n-i) * T («+i) # . . . + xW (3.3) 



with iW = A(") T A(") for n = 1,...,N. We note that since the R x R matrix 
rw is the Hadamard (element-wise) product "*" of symmetric positive semidefinite 
matrices, it too must be symmetric positive semidefinite |33) . Moreover, if each A^ n ) 
has full rank then will be symmetric positive definite (SPD). The first-order 
optimality equations are then given by 

G (n) =0, n = l,..., .AT. (3.4) 

These equations offer a simple way to describe the ALS method for the CP de- 
composition. One iteration of ALS is equivalent to applying one iteration of block 



nonlinear Gauss-Seidel (BNGS) to the optimality equations (3.4). Iterating through 
the modes sequentially, at the nth step the factor matrices are fixed for all modes ex- 
cept n, and the resulting linear least-squares problem is solved for A^™-*. In particular, 
r< n > and *W are updated and A^ <- Z (n) *(")(r( n ))t, where (rW)t is the Moore- 
Penrose pseudoinverse of T^ n \ Owing to the scaling indeterminacy inherent in the 
CP decomposition, it is possible that during ALS some factors may tend to infinity 
while others may compensate by tending to zero, such that the rank-one components 
remain bounded. This behavior can be avoided by using a normalization strategy. 
After each complete ALS iteration the factors of the rth component are normalized 
according to 

^^MrCh) for 7i = 1, . . . , iV, A r =(||aW||...||aW||) 1/Ar (3.5) 

\\\ a r 11/ 

for r — 1,...,R. This normalization equilibrates the norms of the factors of each 
component, i.e., ||ar || = ••• = ||a^ ||. The ALS algorithm described here is used 
as the relaxation method and coarsest-level solver in the setup phase (see Q. We 
note that upon completion of the ALS iterations the rank-one terms are sorted in 
decreasing order of the normalization factors A r . 

4. Multiplicative setup phase. In this section we describe the multilevel hi- 
erarchy constructed in the setup phase of our solver. We use two-level notation to 
describe the interaction of two grids at a time. Coarse grid quantities will usually be 
denoted by a subscript "c", except in cases where a superscript "c" improves read- 
ability. Fine grid quantities and intergrid transfer operators have neither subscripts 
nor superscripts. The multiplicative setup phase described here is similar in concept 
to the setup phase of [31] for computing the SVD of a matrix, and more details about 
this general concept can be found in [31 . 



() 



HANS DE STERCK AND KILLIAN MILLER 



4.1. Derivation of coarse-level equations. The fine-level equations are given 
by the gradient equations stated in [|3j i.e., 

Z (n) (AW • • • A(" +1 ) A< n-1 > • • • A' 1 )) = A^rW (4.1) 

for n = 1, . . . , AT. Suppose there exist AT full rank operators P( n ' € ^ J nXi„, Cj j < 
I n , such that lies approximately in the range of P'"', i.e., for each n, A^ » 

p(n) A ( n ) £ Qr some coarse-level variables A[™' 6 R'»' eXJi . We note that since each 
factor matrix has i? columns, it is unlikely that we can achieve equality. Then it follows 
that the solution to (4.1 1 can be approximated by solving a coarse-level problem 

p(") T Z (n) (p( w 'Af • • • p(«+DA(" +1 ) P^-Da^"- 1 ) • • • P«A«) 
= (p(«) T p(")) A^r£") for n = 1, . . . ,iV, (4.2) 



(n) 

followed by interpolation. Here, T c is defined as in (3.3) with 

T (n) = A («) T (p(n) T p(«)) A (n) { 0T n = 1, . . . , N. 



By property (2.1) the left-hand side of (4.2) can be written as 

P ( " )T Z ( „) fpW • • • P(™ +1 ) P^ 1 ) • • • P^) #W for n = 1, 



,iV, 



with given by (3.2). It then follows by property (2.2) that we can write the 

coarsc-lcvel tensor as 



Z c = Zx 1 Pj x 2 -pT ■■■x N P%, 



(4.3) 



which is essentially a higher dimensional analogue of the Galerkin coarse- level operator 
that is commonly used in multigrid for the matrix case. Defining B^") = pW PW 
for each mode n we can write the coarse-level version of (4.1 ) as follows 

Z\ n) (a^ • • • A(™ +1 > A^ • • • A«) = B("»AWrW. (4.4) 

By the full rank assumption on the interpolation operators it follows that B'™) is 
SPD, and hence we can compute its Cholesky factorization = LWlW , where 
IA n ) is an I nc x I n c nonsingular lower triangular matrix. The Cholesky factors can 
be used to transform (4.4), whereby one obtains an equivalent set of equations that 
correspond to the first-order optimality equations of a coarse-level CP minimization 
problem. Making the change of variables Ac™ = l/ n ) Ac™' for each mode n, and by 
appealing to property (2.1), it follows that (4.4) can be written as 

Z(„) (aw • • • A(™ +1 > A^"- 1 ) • • • A* 1 *) = A< n >r< n >, 

where 



L(" +1 )" 1 ®L("- 1 )" 1 (g)---«L«" n 



MULTIGRID FOR CANONICAL TENSOR DECOMPOSITION 



7 



and I^c = Tc™" 1 for n — 1,...,N. Moreover, by property ( |2.2[ ) the transformed 
coarse-level tensor is given by 



^x^^-x.pW' (4.5) 

where P^™* 1 = p( n )jj( n ) T for n = 1, . . . , N. Therefore, the coarse-level equations are 
equivalent to the gradient equations of the following coarse-level functional: 

; C (A« . . . , AW) := ±\\Z C - [A«, . . . , AW]| 2 . (4.6) 

Therefore, the coarse-level equations can be solved by applying ALS to minimize 
the coarse-level functional f c . Initial approximations of the transformed coarse- level 
variables may be obtained by applying a restriction operator PJ™- 1 to the current fine- 
level approximations. In this paper we follow the standard approach of defining the 
restriction operators as the transposes of the interpolation operators, i.e., 

R(n) _ p(n) T for n = 1, . . . , N. (4.7) 

After solving the coarse-level equations, the coarse-grid-corrected fine-level approxi- 
mations are obtained via prolongation: 

A&?c = P (n) A<"> itxn = l,...,N. (4.8) 

We note that if p( n *> contained A( n ) exactly in its range, and if the coarse-level 
equations were solved exactly, then the coarse-grid-corrected solution would equal the 
exact fine-level solution: 



» 



p(n) A («) _ P (n) L (n) T L (") T A( n > = P^'aW = A<"> for n = 1, . . • , N. 



However, since these conditions only hold approximately we expect (4.8 1 to yield an 
improved but not exact approximation to the fine-level solution. In particular, since 
the approximation properties of the interpolation operators deteriorate as the number 
of components is increased, we expect our method to perform well for a relatively small 
number of components R. 

4.2. Bootstrap AMG V-cycles. In this section we describe how we use Boot- 
strap AMG [TS] to find initial approximations of the desired factor matrices, and 
adaptively determine the interpolation operators that approximately fit the factor 
matrices. We follow the approach outlined in [21 ] [TB I [31 ] . 

We begin by describing the initial BAMG V-cycle. On the finest level we randomly 
choose n t test blocks, where each test block is a collection of N randomly generated 
test factor matrices (TFMs) A t (1) , . . . , A t W . The reason we must consider test blocks 
instead of simply adding more columns to the factor matrices is that rank-one com- 
ponents of the best rank-i? CP tensor approximation must be found simultaneously 
[20] : contrary to the best rank-i? matrix approximation, one cannot obtain the best 
rank-i? CP approximation by truncating the best rank-Q approximation with Q > i?. 
We also start with a collection of iV randomly generated boot factor matrices (BFMs) 
A^ 1 -*, . . . , Aj , which serve as our initial guess to the desired factor matrices. We 
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note that the subscripts "t" and "6" serve only to distinguish between the test and 
boot factors. 

In the downward sweep of the first BAMG V-cycle we relax on each of the test 
blocks separately and also on the BFMs using the ALS algorithm described in S|3] We 
then coarsen each mode on the finest level and determine the interpolation operators 
PW, . . . , PW. The nth interpolation operator p( n ) fits the factors in the nth TFMs 
across all test blocks in a least-squares sense, such that these factors lie approximately 
in the range of P'"' . We also build the coarse-level operator %> and restrict the TFMs 
and BFMs to the first coarse level. This process is then repeated recursively until some 
coarsest level is reached, from which point on we relax only on the BFMs. 

In the upward sweep of the first cycle, starting from the coarsest level, we re- 
cursively interpolate the BFMs up to the next finer level which gives the coarse- 
grid-corrected approximation on that level. We then relax the coarse-grid correction 
(CGC) using ALS. This process continues until the CGC on the finest level has been 
relaxed by ALS. 

The initial BAMG V-cycle can be followed by several additional BAMG V-cycles. 
These cycles are the same as the initial cycle except for one key difference. In the 
downward sweep the nth interpolation operator P^™' fits the factors in the nth TFMs 
across all test blocks as well as the factors in the nth BFM. Since the BFMs serve as 
our initial approximation for the additive phase of the algorithm, they must be well 
represented by interpolation if the additive solve phase is to converge. 

4.3. Interpolation sparsity structure: coarsening. Construction of the in- 
terpolation operators proceeds in two phases. In the first phase, the sparsity structure 
of P'' 1 ) is determined by selecting a subset of the fine-level indices fl n = {1, . . . , I n }. 
This subset, denoted by C„, is the set of coarse indices for mode n (it has cardinality 
In,c < I n )- Fine-level points that are not selected to the coarse level are represented 
by the set of fine indices J„ = £1„ \ G n . For each point i £ 3 n we define a set of coarse 
interpolatory points 6^, which contains coarse points that i interpolates from. For 
convenience we assume that the points in Q l n are labeled by their coarse-level indices. 
Furthermore, for any fine-level point i G C„ we let a(i) denote its coarse-level index. 
The interpolation operator P( n ) is defined by 

( w§\ and jee*„ 

Py = I 1, ieC,, and j = a(i) 
\ 0, otherwise, 

where the loj™ s are the interpolation weights for mode n. The interpolation weights 
are determined by a least-squares process described in §4.4| In this paper we use 
standard geometric coarsening for each mode, whereby C„ consists of the odd num- 
bered points in 17„, and J n consists of the even numbered points. It follows that 
a(i) = (i + l)/2 in this case. For each i € J„ we define G l n = {a(i — + 1)} 

(coarse-level labels) except possibly at the right endpoint. This coarsening works well 
when the modes have approximately the same size, however, for tensors in which the 
sizes of some modes vary widely, a more aggressive coarsening for the larger modes 
may be considered. In §4.5| we discuss a straightforward approach to coarsening ten- 
sors with varying mode sizes. While the simple coarsening procedure discussed here 
works well for the test problems considered in ^6] (PDE problems on high-dimensional 
regular lattices), more general coarsening algorithms for other types of tensors would 
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be desirable. The development of such algorithms remains an interesting topic of 
future research. 

4.4. Least-squares determination of interpolation weights. Suppose that 
mode n has been coarsened and that C„ and 3>, are given. Further suppose that the 
factors in the nth test factor matrices across all test blocks are stored as the columns 
of the I n x Rn t matrix U t , and let U& = for the boot factor matrices. Following 
the approaches of [5J HTJ [T51 [3T] we use a least-squares (LS) process to determine the 
interpolation weights in the rows of that correspond to points in SF n . We want 
to fit the weights of PW such that the vectors in Ut and Ub (except in the first 
cycle) lie approximately in the range of P™. Let the columns of Uy = [U t | U&] hold 
the rit = R(n t + 1) vectors to be fitted. Let be the fcth column of XJf. Let u& c 
be the coarse-level version of obtained by injection, and let (uk, c )j be its value 
in the coarse-level point j. Also, let Uik be the value of U& in the fine-level point i. 
The interpolation weights of each row that corresponds to a point in J n may now be 
determined consecutively by independent least squares fits. Consider a fixed point 
i G 3 n with coarse interpolatory set G l n . We solve the following least-squares problem 
to determine the unknown interpolation weights , 

Uik = ^2 w ij ( u *>c)j for A; = 1, ... , n f . (4.9) 



We make (4.9) over-determined in all cases by choosing n t > M s /R, where M s is the 
maximum interpolation stencil size for any i on any level, i.e., |CJJ < M s . Owing to 
the standard geometric coarsening of each mode M s = 2, and so it is sufficient to use 
n t — 2 for any number of components R > 1. For a rank-one decomposition we must 
take n t > 3. 



In practice it is necessary to formulate (4.9) as a weighted least-squares problem 



where the weights should bias the fit toward the boot factors. We do so as follows. For 



a fixed n, we compute the gradient according to (3.1 ) using the factor matrices 
in the current test block. The weights w r for the mode-n factor vectors are then 
defined as 



|Aj n) |l 
|G(«)|| 



for r = 1,. .. ,R. (4.10) 



Note that for a given mode n and test block all factor vectors have the same weights. 
The weights for all test blocks are stored in the vector w t of length Rnt- The weights 
for the boot factors are computed similarly and are stored in the vector Wf, of length 
R. The full vector of weights w £ M. n f is obtained by "stacking" w t on top of 



Equation (4.10) stems from the observation that G'"J is a residual for the nth factor 
matrix. Therefore, since the BFMs should converge much faster than the TFMs, 
the gradient norm for the BFMs should be much smaller, and hence the weights 
should be larger. We note that weights corresponding to a single factor matrix are 



chosen identical in ( 4.10 1 since we do not want preferential treatment given to different 
factor vectors, but rather to entire factor matrices. In our implementation the small 
weighted least-squares problems with diagonal weight matrix W = diag(w) are solved 
by a standard normal equations approach. 

4.5. Pseudocode. A pseudocode description for a multiplicative setup phase V- 
cycle with ALS as the relaxation scheme and coarsest-level solver is given by Algorithm 
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[T] For the unfamiliar reader we note that a multigrid V-cycle is obtained by performing 
one recursive solve on each level. If two recursive solves are performed on each level 
then a multigrid W-cycle is obtained. In this paper we only consider V-cycles for the 
setup phase. 

Algorithm 1: V-cycle for setup phase of CP decomposition (CP-AMG-mult) 

Input: tensor Z, BFMs A^\ . . . , A^, TFMs 
Output: updated BFMs A^\ . . . , AW, updated TFMs 

1. Compute the set 3 C = {n : I n > I n , CO arsest} 
if J c ^ then 

2. Apply V\ relaxations to TFMs in each test block and to A^ 1 ', . . . , PS N ^ 
for n € 3 C do 

3. Build the interpolation operator p( n ) (on first cycle only use TFMs) 

4. Let BW 4- p(") T p(«) 

5. Compute the Cholesky factorization B^ 1 ) = L' n 'LW 

6. Let p( n ) <- p(™)L(") _T and R» <- pO0 T 
end 



Compute the coarse BFMs and coarse TFMs according to (4.11 ) 

Compute the coarse-level tensor Z <— Z x ne ;j c 
Recursive solve: 

{A* 1 ', . . . , AW} <- CP-AMG-mult ( Z c , AW, ... , A[ N \ coarse TFMs) 
Compute the CGC A 00 for n = l,...,N according to (|4. 12 ) 



10. 

n. Apply V2 relaxations to A^- 1 ', . , . 5 
else 

12. Apply v c relaxations to . . . , A^) 
end 



The CP-AMG-mult algorithm recursively coarsens each mode until it reaches 
some predefined coarsest level. However, since the size of each mode may 
differ, it is possible that some modes may reach their coarsest level sooner than others. 
We address this issue as follows. For each mode n we define a threshold I n ,coarsest to 
be the maximum size of that mode's coarsest level. For any mode n such that I n > 
In.coarsest we continue coarsening, otherwise we do not coarsen that mode further. Let 
the modes that still require further coarsening be indexed by the set 3 C = {n : I n > 
In.coarsest}, and let J' c denote its complement. Then at any given level for each n € "J' c 
it follows that p(") = where is the 7„ x /„ identity matrix. Setting pW 
equal to the identity for all n G 1' c has the following implications. The coarse-level 



tensor can be computed by taking the product in (4.5 1 over the modes in J c , instead 



of for all n = 1, . . . , N. The coarse- level approximations of the BFMs are given by 

A^O = \ ' for n = 1,...,N. 4.11 

1 A<">, net V ' 



Similarly, the coarse-level approximations of the TFMs in each test block are com- 
puted by restricting only those factor matrices indexed by J c . Additionally, the coarse- 



MULTIGRID FOR CANONICAL TENSOR DECOMPOSITION 



11 



grid-corrected BFMs are given by 

A^ c ={ Un) hvn = l,...,N. 4.12 

The size of the coarsest level plays an important role in how the multigrid method 
performs. If the coarsest level is too large then not enough work is done on the coarser 
levels and convergence will be slow. Conversely, choosing too small a coarsest level 
may negatively impact convergence, or in some cases may even cause divergence (as 
in [211 [HI [31] ) . In practice we find that choosing I n .coarsest > R for all n works well. 

5. Full approximation scheme additive solve phase. The Full Approxi- 
mation Scheme (FAS) [5] is the nonlinear analogue of the linear additive correction 
multigrid method. When applied to linear problems FAS reduces to the usual additive 
method, and so it is a more general multigrid solver. In this section we discuss how 
FAS can be used to obtain an additive correction method for the CP decomposition. 
Two-level notation is used to describe the interaction of two grids at a time with 
coarse-grid quantities denoted by a subscript "c" . 



5.1. Coarse-level equations. Recall the finest-level equations (4.1 ), which can 
be expressed as 

A (n) r (n) _ Z(n) ^ A (JV) . . . q A (n+1) Q A («-l) q...q A UA = Q 

for rt = 1, ... , N. Suppose we define nonlinear operators HW, . . . , such that for 
any nG {1,...,A} 

H (n) . |/ixii x . . . x R I N xB. _^ R I n xR 

where 

H^({A}) := A< n 'rW - Z (n) (a'"' • • • A<™ +1 ) A^"" 1 ) • • • A^) . (5.1) 



Note that we use {A} as shorthand for A^, . . . , A^. Then the fine-level problem 
can be formulated as a system of nonlinear equations 

H({A}) := (H«({A}), . . . , HW({A})) = (F^, . . . , F^), (5.2) 

where = for n = l,...,N on the finest level. In order to apply the full 

approximation scheme we require a coarse version of (5.2 1. For each mode n we 
define the coarse operator 

H« ({A c }) := A< n 'rW - Z c {n) ... © A (" +1 ) At™" 1 ' • • • A«) , (5.3) 

where Z c is the coarse-level tensor computed in the multiplicative setup phase. Then 
the coarse-level FAS equations are given by 

H C ({A C }) := (H«({A C }), . . ,Hf)({A c })) = (F« . . . , pW) (5.4) 

where 

FW = R(")(p(") - HW({A})) +HW({A c }) for n = 1,...,N, (5.5) 
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and R,W is the mode n restriction operator from the multiplicative setup phase. Here 



Ac is the coarse-level approximation of A( n ) obtained by restriction. Solving (5.4 1 
for {A c }, the coarse- grid-corrected approximations on the fine level are given by 



A&c = A^ + P (n) (A< n > - AW) for n = 1, . 



.N, 



(5.6) 



where p( n ) is the interpolation operator from the multiplicative setup phase. To- 



gether, Equations (5.2) and (5.4 1 to (5.6| describe a FAS two-level coarse-grid correc- 
tion scheme for the CP optimality equations. In the following sections we describe 
the relaxation scheme used in the solve phase, and give a pseudocode description of 
the multilevel CP-FAS algorithm. 

5.2. Relaxation. We employ block nonlinear Gauss-Seidel (BNGS) as the re- 
laxation scheme and coarsest-level solver for the CP-FAS algorithm (Algorithm [2]). 
Applying BNGS to the equations in (5.2 1 is similar to applying ALS to the CP opti- 
mality equations. One iteration consists of iterating through the modes sequentially, 
where at the nth step and are computed, and then A( n ) is updated by 

solving 



z (n) *w 



(5.7) 



When considering how to solve (5.7 1 for mode n, on any level, we recall the following 
fact: the exact solution is a fixed point of FAS if it is a fixed point of the relaxation 
scheme and the coarsest-level solver. Suppose we update A^™- 1 by post-multiplying 
the right-hand side of (5.7) by (r' n ')t ) which is a small R x R matrix. If is 



nonsingular then its pseudoinverse is equivalent to its inverse, in which case there 
exists a unique solution and the fixed point is preserved. However, if T^™' is singular 
then post-multiplying by the pseudoinverse will in general not preserve the fixed point. 
Therefore, we consider an alternative approach. We propose using a few iterations of 
Gauss-Seidel (GS) to update A'"', which guarantees the fixed point property of our 
relaxation method. Moreover, a result by Keller |16) for positive semidefinite matrices 
states that if r( n ) has nonzero entries on its diagonal then GS must converge to a 
solution (there may be many) of (5.7). Owing to the structure of this condition 



is equivalent to the fundamental requirement that the factor matrices have nonzero 
columns. Therefore, we can be confident that GS will converge regardless of whether 
or not fW is singular. In practice we find that only a few GS iterations are necessary 



to obtain a sufficiently accurate solution to (5.7), and that further iterations do little 



to improve the relaxed approximation. In this paper we use ten GS iterations. 

After each iteration of BNGS on the finest level the factor matrices arc normalized 



according to (3.5 ). Due to the structure of the FAS equations, in particular the right- 



hand side in (5.2), the scaling indeterminacy is not present on the coarser levels and so 



normalizing there is unnecessary. We note that the permutation indeterminacy is also 
removed on the coarser levels because of the (nonzero) right-hand side. Therefore, 
the rank-one terms are sorted in decreasing order of the normalization factors A r only 
on the finest level. 

5.3. CP-FAS algorithm. A pseudocode description for an additive solve phase 
V-cycle is given in Algorithm [2] While other cycling schemes such as W-cycles and 
F-cycles [35] are available to us, in this paper we use V-cycles. We assume that at 
any given level the current tensor, the index set J c , and the interpolation/restriction 
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operators from the setup phase are available to the algorithm. Note that the param- 
eters (it, v c) may be different from those used during the setup phase. 

Algorithm 2: V-cycle for solve phase of CP decomposition (CP-FAS) 

Input: right-hand side matrices {F}, factor matrices {A} 
Output: updated factor matrices {A} 

if not on the coarsest level then 

1. Apply vi relaxations to H({A}) = (F^, . . . , F'"') 

2. Coarse initial guess: 

A< n > <- I . ' n=l,...,N 
\ AW, ne3' c 

3. Coarse right-hand side: 

F(n)< _ f Hi" ) ({A c })+R(»)(F(»)-H(«)({A})), neJ c r=1 n 

4. Recursive solve: 

{A« . . . , A W} <- CP-FAS({F C }, {A c }) 

5. Coarse-grid correction: 

M , , [ p(")(Ai n) -Ai n) ), ne 3 C 

A (n)^ A (n) + i \ o J, n=l,...,N 

[aW-aW, nej; 

6. Apply v 2 relaxations to H({A}) = (F^, . . . , F'"') 
else 

7. Apply v c relaxations to H({A}) = (F^ 1 ), . . .,PW) 
end 

It is instructive to mention the differences of this additive solution phase and the 
additive phase in the SVD case [3T]. For the SVD, singular vectors can be computed 
in separate V-cycles and FAS is not required since the singular values are updated 
in a top-level Ritz step. In the tensor case, all factor vectors need to be computed 



simultaneously in a single FAS V-cycle, and the weights A r from (3.5), which are in 
some sense equivalent to the singular values, are updated in these FAS cycles as well, 
making a Ritz step unnecessary. 

5.4. Full multigrid FAS cycles. For some problems the initial guess provided 
by the multiplicative setup phase may be too far from the solution to yield a convergent 
additive solve phase. One way in which we can try to obtain a better initial guess 
to the fine-level problem is to use Full Multigrid (FMG) [HEUIH]. Full multigrid is 
based on nested iterations whereby coarse levels are used to obtain improved initial 
guesses for fine-level problems. At any given level the problem is first solved on 
the next coarser level after which the solution is interpolated to the current level to 
provide a good initial guess. This process naturally starts at the coarsest level and 
terminates at the finest. Once an initial guess to the finest-level problem has been 
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obtained we can apply repeated CP-FAS cycles to obtain an improved approximate 
solution. We use CP-FAS V-cycles as the solver on each level of the FMG cycle, 
except on the coarsest level where ALS is used (see J3l) . A pseudocode description of 
the FMG-CP-FAS algorithm is given in Algorithm [3j We assume that at any given 
level the current tensor, the index set 3 C , and the interpolation/restriction operators 
from the setup phase are available. In Algorithm [3] we use a subscript I to index the 
current level, where i = 0, . . . , L — 1. Note that a subscript £ on an interpolation 
operator indicates that level I is mapped to level l—l. 

Algorithm 3: Full multigrid cycle for solve phase of CP decomposition (FMG- 
CP-FAS) 

Output: finest-level factor matrices A.^\ . . . , Aq^' 

1. On the coarsest level apply v iterations of ALS with a random initial guess to 
obtain A^, . . . , a£j>j 

2. Set t <r- L - 1 
while £ ^ do 

3. <- P< n) A< n) forn = l,...,A 

4. . . ,A<f|} <- CP-FAS({0}, Af2 v . . . , A^j) 

5. i^l- 1 

end 



6. Implementation details and numerical results. In this section we present 
the results of numerical tests. All experiments are performed using MATLAB version 
7.5.0.342 (R2007b) and version 2.4 of the Tensor Toolbox 0. Timings are reported 
for a laptop running Windows XP, with a 2.50 GHz Intel Core 2 Duo processor and 
4GB of RAM. Initial guesses for the boot factors and test factors are randomly gen- 
erated from the standard uniform distribution. The initial boot factors are also used 
as the initial guess for the standalone ALS method. The stopping criterion for the 
numerical tests is based on the gradient of /. In particular, with 

v 1/2 

J(A") AW) = ^| V||G(")|| 2 , (6.1) 




where is the mode-n partial derivative of / as defined in (3.1 1, we iterate until 



g(A (1 »,.,A< N »)<r, (6.2) 

or until the maximum number of iterations are reached. In this paper we perform at 
most 500 iterations of our multilevel method, and at most 10 4 iterations of ALS. The 



stopping tolerance is fixed at r = 10~ 10 . Table 6.1 lists the parameters used by the 
setup and solve phases. As in |21 [ 115 1 I^Tj- a larger number of relaxations is required 
in the setup cycles to produce sufficiently accurate transfer operators. 

For each numerical test we perform ten runs with a different random initial guess 
for each run. The values reported in the tables represent averages over the successful 
runs, where a run is deemed successful if the stopping criterion is satisfied prior to 
reaching the iteration limit. The tables compare the ALS method and the multilevel 



method with or without FMG-CP-FAS as part of the setup phase (see [6.1). For 
ALS we report the average number of iterations, the average execution time and the 
number of successful runs. For the multilevel method we report the average number 
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Table 6.1 
CP-AMG-mult and CP-FAS parameters 



Parameter 


CP-AMG-mult 


CP-FAS 


Pre-relaxations v\ 


5 


1 


Post-relaxations v% 


5 


1 


Relaxations on coarsest level u c 


100 


50 


Cycle type 


V-cycle 


V-cycle 


Number of test blocks nt 


2 


n/a 



of iterations (setup and solve phases), the average total execution time, the number 
of successful runs, the average speedup over ALS and the number of levels. The 
average speedup is determined as follows. For a given test and run, if both ALS and 
the multilevel method were successful, we divide the execution time of ALS by the 
execution time of the multilevel method to obtain the speedup for that run. These 
values are then averaged to obtain the average speedup for that test. We note that 
execution times do not include the evaluation of the stopping criterion. 

6.1. Implementation details. The multilevel setup and solve phases have thus 
far been described separately, however, these phases can be combined in the following 
simple way. Since the factor matrices lie only approximately in the range of the 
interpolation operators, convergence of the setup cycles, as measured by the functional 
g, should stagnate after a few iterations. Therefore, after each setup cycle the current 
iterate {A new } is compared to the previous iterate {A^} and the setup cycles are 
halted once 

g({A new }) > (l-s)g{{A old }), (6.3) 
where the tolerance is set at e — 0.1. Moreover, at most five setup cycles are per- 



formed, and (6.3 1 is checked only after three setup cycles have completed. Once 



the setup phase has completed, solution cycles are performed until the stopping cri- 



terion (6.2) is satisfied. To improve robustness, we can also try to detect stagna- 
tion of the solution cycles. After five solution cycles have completed, we check if 
g({A new }) > g({A id}) in each subsequent iteration. If this condition is satisfied 
then the current iterate {A new } is discarded and the transfer operators are rebuilt by 
one down-sweep of CP-AMG-mult with the previous iterate {A w} used for the boot 
factors. This process is carried out at most once, and any further indications of stag- 
nation are ignored. Note that the boot factors are not updated by the down-sweep of 
CP-AMG-mult, as doing so would ruin any progress made by the solution cycles. 

The combination of the setup and solve phases described above can be modified to 
include an FMG-CP-FAS cycle as part of the setup phase. After the setup cycles have 
completed, we perform one FMG-CP-FAS cycle to compute a new approximation to 
the boot factors. The transfer operators are then rebuilt using one down-sweep of 
CP-AMG-mult. Note that while the TFMs are updated by this process, the boot 
factors are not. We refer to this combination as 'Multilevel + FMG' in the tables and 
figures. 

We conclude this section by considering the computational costs of one setup 
cycle, one solution cycle, and one FMG cycle. Let I = 0, . . . , L — 1 index the levels, 
and define 

N N N N 

r e — — p e — 17 r e - — - 17 t c'-Vr'-iv? 
1 n — 2 <" LL n ~ 2 Ne '~ 2— 1 ~~ 2 e 71 

n—1 n—1 n=l n—1 
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for any £ > 0. Assume for simplicity that Z> is dense and that each mode is coarsened 
at the same rate with L being the same for each mode. Consideration of Algorithm 
[T] shows that the most expensive operations on each level are the construction of 
the coarse-level operator, the relaxations, and the construction of the interpolation 
operators, in particular computing the weights for the least-squares fits. The coarse- 
level tensor is constructed by sequentially taking the n-mode product of the current 
tensor with the nth restriction operator for n = 1,...,N. Computing 2> on level 
£ requires 0(F S ) operations. The dominant computation for the relaxations and 
least-squares weights is the matrix product Z( n )<&w. Since Z(„) is of size l£ x 
(P £ /I n ) and is of size (P /I n ) x i? on the £th level, forming this product requires 
0(NP l R) operations. Therefore, by summing over all the levels, to leading order one 
setup cycle requires approximately 



(n t (vi + 1) + ui + v 2 + 1) + 



2 JV(i-i) 

2 N 



2 N -1 



O(NPR) 
O(PS) 



(6.4) 



operations, where P = P° and 5 = 5°. We note that PS scales only slightly worse 
than linear in P, and in particular NPI m i n < PS < NPI max where I m i n and I m ax are 
the sizes of the smallest and largest modes, respectively. Consideration of Algorithm 
[2] shows that the most expensive operations on each level are the relaxations and 
the construction of the right-hand sides. By a similar analysis, to leading order one 
solution cycle requires approximately 



2 N - 1 



{vi + v 2 + 1 + l/2 N ) + 



2 N(L-1) 



O(NPR) 



(6.5) 



operations. Similarly, it follows that to leading order one FMG-CP-FAS cycle requires 
approximately 



1 



(z/ 1 +^ 2 + l + l/2 iv ) 



N v + (L - l)v c 



2 N(L-1) 



O(NPR) (6.6) 



operations. Note that in (6.6) v is the number of ALS iterations performed on the 
coarsest level and (yi, v>2, v c ) are the CP-FAS parameters. In general a solution 
cycle is significantly cheaper than a setup cycle because of the extra work required 
by a setup cycle to relax on the TFMs, the typically larger number of relaxations 
performed on each level of a setup cycle, and the added work of constructing the 
coarse-level tensors (i.e., the O(PS) term). If 2> is sparse then further savings are 
possible on the finest level. In particular, to leading order the cost of one relaxation 
reduces to NR times the number of nonzero elements in Z. In our current framework 
the coarse tensors will in general be dense; multiplication by the inverted Cholesky 



factors as in (4.5) destroys any sparsity. Therefore, it may be interesting to consider 



alternative formulations of the coarse-level equations, for example, working directly 



with equations of the form in (4.4); see also [21] 



6.2. Sparse tensor test problem. The first test problem we consider is the 
standard finite difference Laplacian tensor on a uniform grid of size s d in d dimensions. 
This test problem yields an A^-mode sparse tensor % of size s x sx ■ ■ ■ x s with N = 2d. 
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We can efficiently construct %> by reshaping the s d x s d matrix 

d 

Z = ^2h(k) 0D(8)I r(A:) , 

k=l 

where l r /f-) is the s fc_1 x s k ~ 1 identity matrix, Ie(k) is the s d ~ k x s d ~ k identity matrix, 
and D is the sx s tridiagonal matrix with stencil [—1, 2, —1]. While this test problem 
is somewhat pedagogical in nature, it offers a good starting point to illustrate our 
method. The parameters for the various test tensors that we consider are given in 
Table [Ol 

Table 6.2 
Parameters for sparse problem. 



test problem parameters 



1 


N 


= 4, 


s = 20, R = 


4 


2 


N 


= 4, 


s = 20, R = 


5 


3 


N 


= 4, 


s = 20, R = 


6 


1 


N 


= 4, 


s = 50, R = 


2 


5 


N 


= 4, 


s = 50, R = 


3 


6 


N 


= 4, 


s = 50, R = 


4 


7 


N 


= 6, 


s = 20, R = 


2 


8 


N 


= 6, 


s = 20, R = 


3 


9 


N 


= 6, 


s = 20, R = 


4 


10 


N 


= 8, 


s = 10, R = 


2 


11 


N 


= 8, 


s = 10, R = 


3 



Table 6.3 

Sparse problem. Average number of iterations and time (in seconds) until the stopping criterion 
is satisfied with stopping tolerance 10 -10 . Here 'it' is the number of iterations, 'spd' is the multilevel 
speedup compared to ALS, 'ns' is the number of successful runs, and 'levs' is the number of levels. 



ALS Multilevel Multilevel + FMG 



test 


it 


time 


us 


it 


time 


spd 


us 


it 


time 


spd 


us 


levs 


1 


1897 


24.0 


10 


37 


7.6 


3.2 


10 


36 


8.0 


3.1 


10 


2 


2 


3329 


53.6 


8 


64 


15.1 


4.5 


10 


42 


11.8 


5.1 


10 


2 


3 


3587 


70.3 


9 


67 


17.5 


4.0 


9 


32 


10.6 


6.8 


10 


2 


4 


5457 


105.9 


10 


123 


40.5 


2.7 


10 


120 


41.2 


2.6 


10 


4 


5 


5508 


150.3 


1 


182 


64.9 


2.3 


9 


99 


40.9 


3.8 


9 


4 


6 


6788 


244.0 


3 


150 


62.0 


5.2 


10 


136 


58.7 


5.1 


9 


4 


7 


1619 


187.0 


10 


48 


111.4 


1.7 


10 


52 


128.3 


1.5 


10 


3 


8 


3481 


610.0 


10 


72 


164.8 


3.7 


10 


70 


177.9 


3.5 


10 


3 


9 


4085 


939.5 


10 


76 


209.6 


4.5 


10 


78 


228.9 


4.2 


10 


3 


10 


634 


229.5 


10 


48 


170.8 


1.3 


10 


56 


203.5 


1.1 


10 


3 


11 


1743 


943.3 


10 


39 


402.3 


2.3 


10 


4.3 


474.0 


2.0 


10 


3 



The results of the tests for the sparse test case are given in Table 6.3 The results 
show that our multilevel approach is anywhere from two to seven times faster than 
ALS for this test problem. For tests 1 to 6 (order 4 tensors), larger speedups are 
observed for the multilevel method with FMG. However, for tests 7 to 11 (order 6 and 
8 tensors) larger speedups are observed for the multilevel method without FMG. For 
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higher-order tensors the setup phase of multilevel method with FMG is considerably 
more expensive than the setup phase of the multilevel method without FMG. The 
multilevel variants demonstrate similar robustness to varying initial guesses for this 
problem, however, in general we expect the multilevel method with FMG to be the 
most robust option. We also observe the trend that for each grouping of tests in Table 
|6.3] the speedup tends to increase as the number of components R increases. 

Figures [6.1| and [672] illustrate the convergence history of ALS and the multilevel 
method for one run of tests 3 and 9 in Table |6.3| respectively. These plots are 
typical of the performance observed for this test problem. We note that the spike 
in the 'Multilevel + FMG' curves is due to the initial approximation to the solution 
computed by the single FMG-CP-FAS cycle performed after the setup cycles. 




25 50 75 100 30 60 90 120 

Iterations Time (sec) 



Fig. 6.1. Sparse problem. Convergence plot for test 3 from Table \6J^ (N = 4, s = 20, R = 6). 
The solid black line is ALS, the solid gray line is the multilevel method with FMG, and the dashed 
line is the multilevel method without FMG. 

6.3. Dense tensor test problem. The second test problem we consider is a 
dense, symmetric third-order tensor %> £ ^xsxs wnose elements arc given by 

Zijk = {i + 3 + k > tor i,j,k = 1, ...,s. 

This tensor was used as a test case in [27| , which compares various methods for com- 
puting the CP decomposition including ALS. It was also considered in [21], which 
describes a novel method for computing the Tucker decomposition of third-order ten- 
sors. As mentioned in [28], 2> arises from the numerical approximation of an inte- 
gral equation with kernel l/||x — y|| acting on the unit cube and discretized by the 
Nystrom method on a uniform grid. In this section we compute CP decompositions 
for R = 2, 3, 4, 5. It has been observed numerically that when R > 4, ALS may 
be extremely slow to converge, for some initial guesses requiring on the order of 10 5 
iterations, with highly non-monotonic convergence behavior. The performance of our 
method when R > 4 is less robust than desired because the multigrid framework uses 
a single interpolation operator for each factor matrix. Even so, depending on the 
initial guess our method may still demonstrate a significant improvement over ALS. 
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Table 6.4 

Dense problem. Average number of iterations and time (in seconds) until the stopping criterion 
is satisfied with stopping tolerance 10 — 10 . Here 'it' is the number of iterations, 'spd' is the multilevel 
speedup compared to ALS, 'ns' is the number of successful runs, and 'levs' is the number of levels. 



ALS Multilevel + FMG 



test 


problem parameters 


it 


time 


ns 


it 


time 


spd 


ns 


levs 


1 


s= 50, R = 2 


161 


0.7 


10 


7 


2.2 


0.3 


10 


5 


2 


s= 50, R = 3 


2435 


11.8 


10 


15 


2.7 


4.7 


10 


5 


3 


s= 50, R = 4 


4838 


26.1 


5 


97 


10.1 


4.1 


7 


1 


4 


s = 50, R = 5 








301 


30.0 




3 


1 


5 


s = 100, R = 2 


253 


10.7 


10 


7 


7.9 


1.4 


10 


6 


6 


s = 100, R = 3 


1695 


80.2 


9 


9 


8.1 


10.1 


10 


6 


7 


s = 100, R = 4 


3836 


202.2 


6 


82 


27.5 


14.1 


9 


5 


8 


s = 100, R = 5 


7854 


455.2 


2 


192 


62.0 


9.0 


4 


5 


9 


s = 200, = 2 


274 


90.3 


10 


7 


48.8 


1.9 


10 


7 


10 


s = 200, R = 3 


1830 


682.3 


10 


12 


61.7 


11.2 


10 


7 


11 


s = 200, R = 4 


2998 


1249.5 


8 


79 


178.0 


11.6 


9 


6 


12 


s = 200, R = 5 


5686 


2611.4 


3 


220 


440.7 


2.6 


1 


6 



The results of the tests for the dense test case are given in Table |6.4| We note 
that only the multilevel method with FMG is considered for this test problem (see the 
description in { 6.1 1. The blank entries for test 4 in Table 6.4 indicate that ALS did not 
have any successful runs. For R > 3 our multilevel approach can lead to significant 
savings in iterations and execution time. The speedup is less impressive when R = 
2, since ALS already converges quickly without any multigrid acceleration. It is 
also apparent that as the number of components increases, the number of successful 
runs of the multilevel method, and of ALS, decreases. For initial guesses in which 
the multilevel method failed to converge, there was typically a rapid decrease in 
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the gradient norm, followed by convergence stagnation of the solution cycles. This 
behavior suggests that the setup phase was unable to construct transfer operators that 
adequately represented the solution in their range. Such cases were also characterized 
by slow convergence of ALS. 




5 10 15 20 15 30 45 60 



Iterations Time (sec) 

Fig. 6.3. Dense problem. Convergence plot for test 6 from Table \0\4\ ( s = 100, R = 3). The 
solid black line is ALS and the solid gray line is the multilevel method with FMG. 




10 20 30 40 25 50 75 100 

Iterations Time (sec) 



Fig. 6.4. Dense problem. Convergence plot for test 7 from Table \b\4\ (s = 100, R = i). The 
solid black line is ALS and the solid gray line is the multilevel method with FMG. 

Figures |6.3| |6.4| and |6.5| illustrate the convergence history of ALS and the mul- 
tilevel method for one run of tests 6, 7 and 8 in Table [6T4J respectively. Figure [675| 
(R = 5) shows how ALS can initially be slow to converge with erratic convergence 
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Fig. 6.5. Dense problem. Convergence plot for test 8 from Table \6\4\ ( s = 100, R = 5). The 
solid black line is ALS and the solid gray line is the multilevel method with FMG. 

behavior: for the first half of the run its gradient norm fluctuates with little decrease. 
Such behavior can make it very difficult for the setup phase to construct adequate 
transfer operators. 

7. Concluding remarks. We have presented a new algorithm for computing 
the rank-i? canonical decomposition of a tensor for small R. As far as we are aware 
our method is the first genuine multigrid algorithm for computing the CP decompo- 
sition. Our work is also significant in that it presents the first adaptive AMG method 
for a nonlinear optimization problem. Similar to the method presented in |31| for 
computing SVD triplets of a matrix, we combined an adaptive multiplicative setup 
phase with an additive solve phase. The ALS method was used as the relaxation 
scheme. Numerical tests with dense and sparse tensors of varying sizes and orders 
(up to order 8) that are related to PDE problems showed how our multilevel method 
can lead to significant speedup over standalone ALS when high accuracy is desired. 

Avenues of further research are plentiful. For example, it may be worthwhile to 
investigate a more sophisticated setup phase that iteratively combines the CP-AMG- 
mult cycles and CP-FAS-FMG cycles. As discussed briefly in |6.1| an alternative 
formulation of the coarse-level equations without the inverted Cholesky factors may 
be fruitful for sparse problems. In addition to PDE-related tensors, there may be 
other classes of tensors for which multigrid acceleration of ALS may be beneficial, 
but identifying and studying such classes remains a topic of future research. It would 
also be interesting to consider other ALS-type methods for the relaxations, for ex- 
ample, those using line searches, as well as to apply our method to the regularized 
optimization formulation of CP as described in pQ. Similarly, it would be interesting 
to generalize our multilevel framework to other similar tensor optimization problems 
such as the Tucker decomposition [20], block tensor decompositions [23l|26], best rank- 
(i?i, . . . , Rn) approximations O [14], and to other nonlinear optimization problems. 
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